10 Minutes to pandas

10分钟简单介绍pandas

首先,导入模块如下所示:

1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pandas数据结构:Series

Series可以简单地被认为是一维的数组。 Series 和一维数组最主要的区别在于 Series类型具有索引( index ),可以和另一个编程中常见的数据结构哈希( Hash )联系起来。

创建Series类型数据结构,如果没有传入索引,pandas默认的索引为从0开始的整数。

1
s = pd.Series([1,3,5,np.nan,6,8])
1
s
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

pandas数据结构:DataFrame

DataFrame 是将数个 Series 按列合并而成的二维数据结构,每一列单独取出来是一个 Series ,这和 SQL 数据库中取出的数据是很类似的。所以,按
列对一个 DataFrame 进行处理更为方便,用户在编程时注意培养按列构建数据的思维。 DataFrame 的优势在于可以方便地处理不同类型的列,因此,就不要考虑如何对一个全是浮点数的 DataFrame 求逆之类的问题了,处理这种问题还是把数据存成 NumPy 的 matrix 类型比较便利一些。

通过传入 numpy array数据创建 DataFrame:

1
dates = pd.date_range('20130101', periods=6)
1
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
1
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
1
df
A B C D
2013-01-01 0.212880 0.351725 -1.350579 -0.107403
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
2013-01-03 -0.245746 -0.226585 1.749624 1.140817
2013-01-04 0.032400 -0.264382 0.125095 -1.322739
2013-01-05 -2.260707 0.064878 0.231025 0.682991
2013-01-06 0.603739 1.490709 0.249649 1.822501

传入字典对象创建DataFrame:

1
2
3
4
5
6
df2 = pd.DataFrame({ 'A' : 1.,
....: 'B' : pd.Timestamp('20130102'),
....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D' : np.array([3] * 4,dtype='int32'),
....: 'E' : pd.Categorical(["test","train","test","train"]),
....: 'F' : 'foo' })
1
df2
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
1
df2.F
0    foo
1    foo
2    foo
3    foo
Name: F, dtype: object
1
df2.A
0    1
1    1
2    1
3    1
Name: A, dtype: float64

查看数据顶部或底部的几行:

1
df.head(2)
A B C D
2013-01-01 0.212880 0.351725 -1.350579 -0.107403
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
1
df.tail(3)
A B C D
2013-01-04 0.032400 -0.264382 0.125095 -1.322739
2013-01-05 -2.260707 0.064878 0.231025 0.682991
2013-01-06 0.603739 1.490709 0.249649 1.822501

显示行列索引和里面的值;

1
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
1
df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
1
df.values
array([[ 0.21287973,  0.35172526, -1.35057903, -0.10740265],
       [-0.85790301, -1.78332415,  1.16288782, -0.48822551],
       [-0.24574644, -0.22658458,  1.74962416,  1.14081656],
       [ 0.03240016, -0.26438175,  0.12509531, -1.32273918],
       [-2.26070679,  0.06487812,  0.23102475,  0.68299111],
       [ 0.60373902,  1.4907093 ,  0.24964875,  1.82250141]])

显示数据的简单统计:

1
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.419223 -0.061163 0.361284 0.287990
std 1.026018 1.061053 1.056953 1.148160
min -2.260707 -1.783324 -1.350579 -1.322739
25% -0.704864 -0.254932 0.151578 -0.393020
50% -0.106673 -0.080853 0.240337 0.287794
75% 0.167760 0.280013 0.934578 1.026360
max 0.603739 1.490709 1.749624 1.822501

数据转置:

1
df.T
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A 0.212880 -0.857903 -0.245746 0.032400 -2.260707 0.603739
B 0.351725 -1.783324 -0.226585 -0.264382 0.064878 1.490709
C -1.350579 1.162888 1.749624 0.125095 0.231025 0.249649
D -0.107403 -0.488226 1.140817 -1.322739 0.682991 1.822501

按某个索引排序:

1
df.sort_index(axis=1,ascending=False)
D C B A
2013-01-01 -0.107403 -1.350579 0.351725 0.212880
2013-01-02 -0.488226 1.162888 -1.783324 -0.857903
2013-01-03 1.140817 1.749624 -0.226585 -0.245746
2013-01-04 -1.322739 0.125095 -0.264382 0.032400
2013-01-05 0.682991 0.231025 0.064878 -2.260707
2013-01-06 1.822501 0.249649 1.490709 0.603739

按数据的值排序:

1
df.sort_values(by='B')
A B C D
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
2013-01-04 0.032400 -0.264382 0.125095 -1.322739
2013-01-03 -0.245746 -0.226585 1.749624 1.140817
2013-01-05 -2.260707 0.064878 0.231025 0.682991
2013-01-01 0.212880 0.351725 -1.350579 -0.107403
2013-01-06 0.603739 1.490709 0.249649 1.822501

选出某一类:(同df.A)

1
df['A']
2013-01-01    0.212880
2013-01-02   -0.857903
2013-01-03   -0.245746
2013-01-04    0.032400
2013-01-05   -2.260707
2013-01-06    0.603739
Freq: D, Name: A, dtype: float64

通过[]切分出几行:

1
df[0:3]
A B C D
2013-01-01 0.212880 0.351725 -1.350579 -0.107403
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
2013-01-03 -0.245746 -0.226585 1.749624 1.140817
df['20130102':'20130104']
A B C D
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
2013-01-03 -0.245746 -0.226585 1.749624 1.140817
2013-01-04 0.032400 -0.264382 0.125095 -1.322739

通过标签选择:

1
df.loc[dates[0],['A','B']]
1
2
3
A    0.212880
B 0.351725
Name: 2013-01-01 00:00:00, dtype: float64

通过位置选取:

1
df.iloc[1:3,0:2]
A B
2013-01-02 -0.857903 -1.783324
2013-01-03 -0.245746 -0.226585

reindex方法,能够增加行和列:

1
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
1
2
df1.loc[dates[0]:dates[1],'E'] = 1
df1
A B C D E
2013-01-01 0.212880 0.351725 -1.350579 -0.107403 1
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226 1
2013-01-03 -0.245746 -0.226585 1.749624 1.140817 NaN
2013-01-04 0.032400 -0.264382 0.125095 -1.322739 NaN

丢失数据的处理:

去掉有丢失数据的所有行:

1
df1.dropna(how='any')
A B C D E
2013-01-01 0.212880 0.351725 -1.350579 -0.107403 1
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226 1

填充丢失数据

1
df1.fillna(value=5)
A B C D E
2013-01-01 0.212880 0.351725 -1.350579 -0.107403 1
2013-01-02 -0.857903 -1.783324 1.162888 -0.488226 1
2013-01-03 -0.245746 -0.226585 1.749624 1.140817 5
2013-01-04 0.032400 -0.264382 0.125095 -1.322739 5

判断是否有丢失数据:

1
pd.isnull(df1)
A B C D E
2013-01-01 False False False False False
2013-01-02 False False False False False
2013-01-03 False False False False True
2013-01-04 False False False False True

读取文件

写csv文件:

1
df.to_csv('foo.csv')

读csv文件:

1
pd.read_csv('foo.csv')
Unnamed: 0 A B C D
0 2013-01-01 0.212880 0.351725 -1.350579 -0.107403
1 2013-01-02 -0.857903 -1.783324 1.162888 -0.488226
2 2013-01-03 -0.245746 -0.226585 1.749624 1.140817
3 2013-01-04 0.032400 -0.264382 0.125095 -1.322739
4 2013-01-05 -2.260707 0.064878 0.231025 0.682991
5 2013-01-06 0.603739 1.490709 0.249649 1.822501
1
2


参考资料

10 Minutes to pandas¶